-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Core] Refactor reference_counter out of memory store and plasma store #57590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| Status CoreWorkerPlasmaStoreProvider::Get( | ||
| const absl::flat_hash_set<ObjectID> &object_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic below lifts the owner address resolution logic into the caller of Get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a behavior change here. Before, we'd fetch the owner addresses for each batch. Now, we'll get them all up-front in the caller. It's probably fine. Keep an eye on the perf numbers to make sure there isn't a regression tho
src/ray/core_worker/core_worker.cc
Outdated
| } | ||
| } | ||
|
|
||
| absl::flat_hash_map<ObjectID, rpc::Address> CoreWorker::GetObjectIdToOwnerAddressMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function represents the owner address resolution logic lifted out of plasma store.
| for (auto &get_request : get_requests) { | ||
| get_request->Set(object_id, object_entry); | ||
| // If ref counting is enabled, override the removal behaviour. | ||
| if (get_request->ShouldRemoveObjects() && ref_counter_ == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove after get flag is getting removed since it stems from the dark ages when reference counting wasn't a thing.
| const ray::RayObject &object, const ObjectID &object_id)> object_allocator) | ||
| : io_context_(io_context), | ||
| ref_counter_(counter), | ||
| reference_counting_(reference_counting), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All reference counter null checks for whether reference counting is enabled is lifted into the caller.
| void CoreWorkerMemoryStore::Put(const RayObject &object, const ObjectID &object_id) { | ||
| void CoreWorkerMemoryStore::Put(const RayObject &object, | ||
| const ObjectID &object_id, | ||
| const bool has_reference) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory store only uses reference counter for checking if it has reference to an object or if reference counting is enabled. Both is lifted into caller.
| explicit DefaultCoreWorkerMemoryStoreWithThread( | ||
| std::unique_ptr<InstrumentedIOContextWithThread> io_context) | ||
| : CoreWorkerMemoryStore(io_context->GetIoService()), | ||
| : CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| : CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false), | |
| : CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/true), |
There's no reason for the default core worker memory store to be created with reference counting disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not used in any local mode tests, then you should set it to true.
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't looked at this too hard yet, but at first glance it looks like having address in that vector will make wait/get perf on large lists even worse than it already is bc the impact of the vector set copying madness will be even worse
Do we need address to be in there??
Signed-off-by: davik <[email protected]>
…inimize_ref_count_dep
Nope, definitely doesn't need to be in the same data structure. I've moved it out for performance. |
|
Needs to rebase |
9d17cce to
7b55e41
Compare
| explicit DefaultCoreWorkerMemoryStoreWithThread( | ||
| std::unique_ptr<InstrumentedIOContextWithThread> io_context) | ||
| : CoreWorkerMemoryStore(io_context->GetIoService()), | ||
| : CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not used in any local mode tests, then you should set it to true.
|
|
||
| Status CoreWorkerPlasmaStoreProvider::GetObjectsFromPlasmaStore( | ||
| absl::flat_hash_set<ObjectID> &remaining, | ||
| absl::flat_hash_map<ObjectID, int64_t> &remaining_object_id_to_idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a performance optimization. Since we now take in both the owner id and address in Get() instead of resolving owner address dynamically, we need a way to associate owner_id with its owner address quickly.
| } | ||
|
|
||
| Status CoreWorkerPlasmaStoreProvider::Get( | ||
| const absl::flat_hash_set<ObjectID> &object_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a behavior change here. Before, we'd fetch the owner addresses for each batch. Now, we'll get them all up-front in the caller. It's probably fine. Keep an eye on the perf numbers to make sure there isn't a regression tho
| std::vector<ObjectID> batch_ids; | ||
|
|
||
| int64_t num_total_objects = static_cast<int64_t>(object_ids.size()); | ||
| absl::flat_hash_map<ObjectID, int64_t> remaining_object_id_to_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the index for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<ObjectID> object_ids = {return_id}; | ||
| auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about overloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<ObjectID> object_ids = | ||
| std::vector<ObjectID>(plasma_object_ids.begin(), plasma_object_ids.end()); | ||
| auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not overload GetOwnerAddresses instead of creating a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, GetOwnerAddresses associate the output owner address vector with the input object_ids by index. If we take in a set instead, the ordering is now ill-defined. The current change should not cause any performance changes here as we simply moved the object_id vectorization out of plasma_store_provider and into core_worker.cc https://github.com/ray-project/ray/blob/master/src/ray/core_worker/store_provider/plasma_store_provider.cc#L262.
From offline discussion, we've concluded that the current way of storing owner address as a vector is poor cardinality as most object ids likely share the same owner address. However, the change to reduce cardinality will require significant logic changes across multiple files. GetOwnerAddress will be overloaded in a future PR that addresses the cardinality issue.
| hdrs = ["store_provider/plasma_store_provider.h"], | ||
| deps = [ | ||
| ":common", | ||
| ":core_worker_context", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay
…e by default Signed-off-by: davik <[email protected]>
| // Prepare object ids vector and owner addresses vector | ||
| std::vector<ObjectID> object_ids = | ||
| std::vector<ObjectID>(plasma_object_ids.begin(), plasma_object_ids.end()); | ||
| auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Avoid Unnecessary Vector Copies
Creating a vector from a set and then calling GetOwnerAddresses() is inefficient. The PR discussion mentions "Why not overload GetOwnerAddresses instead of creating a vector?" This creates an unnecessary copy of potentially many ObjectIDs. The same issue appears at lines 3093-3094 and 3365-3367, but those involve smaller vectors so the performance impact is less severe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #57590 (comment)
Why are these changes needed?
As discovered in the PR to better define the interface for reference counter, plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability.
The main logic changes are located in
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.